Open In Colab

Introduction to Exploratory Data Analysis¶

Agenda¶

  • Loading Libraries
  • Loading Data
  • Data Analysis
  • Data Visualization

In this tutorial, you will learn one of the important step of a Data Science pipeline i.e. Exploratory Data Analysis.

Exploratory Data Analysis (EDA): In this step, we try to understand the data and the underlying interactions between different variables.

The dataset we will be using here is the wine quality data.

Data Set Information: The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. Quality is based on sensory scores (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

Loading Libraries¶

All Python capabilities are not loaded to our working environment by default (even they are already installed in your system). So, we import each and every library that we want to use.

In data science, numpy and pandas are most commonly used libraries. Numpy is required for calculations like means, medians, square roots, etc. Pandas is used for data processin and data frames. We chose alias names for our libraries for the sake of our convenience (numpy --> np and pandas --> pd).

In [1]:
import numpy as np        # Fundamental package for linear algebra and multidimensional arrays
import pandas as pd       # Data analysis and manipultion tool

Loading Data¶

Pandas module is used for reading files. We have our data in '.csv' format. We will use 'read_csv()' function for loading the data.

In [2]:
# In read_csv() function, we have passed the location to where the files are located in the UCI website. The data is separated by ';'
# so we used separator as ';' (sep = ";")
red = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=";")
white = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=";")

Let's check how our data looks. This can be done using head() method.

In [3]:
# Red Wine
red.head()    # we can also pass the number of records we want in the brackets (). By default it displays first 5 records.
Out[3]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
In [4]:
# White Wine
white.head(6)      # we will get first 6 records from white wine data
Out[4]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.0 0.27 0.36 20.7 0.045 45.0 170.0 1.0010 3.00 0.45 8.8 6
1 6.3 0.30 0.34 1.6 0.049 14.0 132.0 0.9940 3.30 0.49 9.5 6
2 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6
3 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
4 7.2 0.23 0.32 8.5 0.058 47.0 186.0 0.9956 3.19 0.40 9.9 6
5 8.1 0.28 0.40 6.9 0.050 30.0 97.0 0.9951 3.26 0.44 10.1 6

Le's explore the attributes / columns of the datasets. Both red and white wines have the same columns.

In [5]:
# Columns / Attribute
red.columns
Out[5]:
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

Target Variable:¶

The target variable of a dataset is the feature of a dataset about which you want to gain a deeper understanding. It is the variable that is, or should be the output.

Here quality is the target variable as we're trying to know which of the two types of wines have a better quality.

Input Variables:¶

One or more variables that are used to determine (or predict) the 'Target Variable' are known as Input Variables. They are sometimes called Predictor Variable as well.

In our example, the input variables are: 'fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density','pH', 'sulphates', and 'alcohol'.

All of these will help us predict the quality of the wine.

Variables/Features:¶

Variables and features both are the same, they are often used interchangeably. All the column names in a dataset are variables.

Exploratory Data Analysis (EDA)¶

After loading the data, it is important to examine the data. It is usually not recommended to directly throw all the data into the model without understanding the data. This step always helps in improving our model.

Concatenate both the data.

In [6]:
# Add a column to separate whether the wine is red or white.
red['color'] = 'r'
white['color'] = 'w'
In [7]:
wine = pd.concat([red, white])

Let's rename the columns which contain spaces in their names.

In [8]:
# rename() function is used to rename the columns

# wine data
# red_wine_data
wine.rename(columns={'fixed acidity': 'fixed_acidity', 'citric acid':'citric_acid', 'volatile acidity':'volatile_acidity',
                          'residual sugar':'residual_sugar', 'free sulfur dioxide':'free_sulfur_dioxide', 'total sulfur dioxide':'total_sulfur_dioxide'},
                 inplace = True)

# red_wine_data
red.rename(columns={'fixed acidity': 'fixed_acidity', 'citric acid':'citric_acid', 'volatile acidity':'volatile_acidity',
                          'residual sugar':'residual_sugar', 'free sulfur dioxide':'free_sulfur_dioxide', 'total sulfur dioxide':'total_sulfur_dioxide'},
                 inplace = True)    # inplace = True makes changes in the dataframe itself

# white_wine_data
white.rename(columns={'fixed acidity': 'fixed_acidity', 'citric acid':'citric_acid', 'volatile acidity':'volatile_acidity',
                          'residual sugar':'residual_sugar', 'free sulfur dioxide':'free_sulfur_dioxide', 'total sulfur dioxide':'total_sulfur_dioxide'},
                 inplace = True)
In [9]:
red.head(2)
Out[9]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol quality color
0 7.4 0.70 0.0 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 r
1 7.8 0.88 0.0 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5 r
In [10]:
# concise summary about dataset
red.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         1599 non-null   float64
 1   volatile_acidity      1599 non-null   float64
 2   citric_acid           1599 non-null   float64
 3   residual_sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free_sulfur_dioxide   1599 non-null   float64
 6   total_sulfur_dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
 12  color                 1599 non-null   object 
dtypes: float64(11), int64(1), object(1)
memory usage: 162.5+ KB
In [11]:
white.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         4898 non-null   float64
 1   volatile_acidity      4898 non-null   float64
 2   citric_acid           4898 non-null   float64
 3   residual_sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free_sulfur_dioxide   4898 non-null   float64
 6   total_sulfur_dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
 12  color                 4898 non-null   object 
dtypes: float64(11), int64(1), object(1)
memory usage: 497.6+ KB

It is always interesting to know the basic statistical characteristics of each numerical variables.

In [12]:
# Basic Statistical details 
red.describe()
Out[12]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol quality
count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000
mean 8.319637 0.527821 0.270976 2.538806 0.087467 15.874922 46.467792 0.996747 3.311113 0.658149 10.422983 5.636023
std 1.741096 0.179060 0.194801 1.409928 0.047065 10.460157 32.895324 0.001887 0.154386 0.169507 1.065668 0.807569
min 4.600000 0.120000 0.000000 0.900000 0.012000 1.000000 6.000000 0.990070 2.740000 0.330000 8.400000 3.000000
25% 7.100000 0.390000 0.090000 1.900000 0.070000 7.000000 22.000000 0.995600 3.210000 0.550000 9.500000 5.000000
50% 7.900000 0.520000 0.260000 2.200000 0.079000 14.000000 38.000000 0.996750 3.310000 0.620000 10.200000 6.000000
75% 9.200000 0.640000 0.420000 2.600000 0.090000 21.000000 62.000000 0.997835 3.400000 0.730000 11.100000 6.000000
max 15.900000 1.580000 1.000000 15.500000 0.611000 72.000000 289.000000 1.003690 4.010000 2.000000 14.900000 8.000000
In [13]:
white.describe()
Out[13]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol quality
count 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000
mean 6.854788 0.278241 0.334192 6.391415 0.045772 35.308085 138.360657 0.994027 3.188267 0.489847 10.514267 5.877909
std 0.843868 0.100795 0.121020 5.072058 0.021848 17.007137 42.498065 0.002991 0.151001 0.114126 1.230621 0.885639
min 3.800000 0.080000 0.000000 0.600000 0.009000 2.000000 9.000000 0.987110 2.720000 0.220000 8.000000 3.000000
25% 6.300000 0.210000 0.270000 1.700000 0.036000 23.000000 108.000000 0.991723 3.090000 0.410000 9.500000 5.000000
50% 6.800000 0.260000 0.320000 5.200000 0.043000 34.000000 134.000000 0.993740 3.180000 0.470000 10.400000 6.000000
75% 7.300000 0.320000 0.390000 9.900000 0.050000 46.000000 167.000000 0.996100 3.280000 0.550000 11.400000 6.000000
max 14.200000 1.100000 1.660000 65.800000 0.346000 289.000000 440.000000 1.038980 3.820000 1.080000 14.200000 9.000000

Let's explore different statistical measures that we have got from desribe().

  • count: total count of non-null values in the column
  • mean: the average of all the values in that column
  • min: the minimum value in the column
  • max: the maximum value in the column
  • 25%: first quartile in the column after we arrange those values in ascending order
  • 50%: this is the median or the second quartile
  • 75%: the third quartile
  • std: this is the standard deviation (i.e. measure of depreciation, you must have read in the basics of statistics study material)

Note: 25%, 50%, and 75% are nothing but corresponding percentile values

Our brains are good at spotting patterns in pictures. Let's play around different types of data visualizations.

In [14]:
# first import data visualizations libraries
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
# To ignore warnings
import warnings
warnings.filterwarnings('ignore')
In [46]:
# red wines      
red.hist(bins=10, figsize=(16,12), grid = False)
plt.show()

The distribution of the attribute seems to be positively skewed. The attributes 'density' and 'pH' are quite normally distributed (you must have read about normal distribution in basic statistics study material). Now looking at the attribute quality, we can observe that the wines with average quality (i.e. quality rating 5 to 7) are more than bad and good quality of wines.

In [45]:
# white wine
white.hist(bins=10, figsize=(16,12), grid = False)
plt.show()

For white wines, the attribute 'pH' is quite normally distributed. The average quality of wines are more than good and bad qualities of wines for white wine data too. Most of the wines seem to be containing alcohol percentage in the range 8.5% to nearly 13%.

We can use pivot tables to observe the values of different features for each quality of wines.

In [17]:
# Creating pivot table for red wine
columns = list(red.columns).remove('quality')
red.pivot_table(columns, ['quality'], aggfunc=np.median)    # By default the aggfunc is mean
Out[17]:
alcohol chlorides citric_acid density fixed_acidity free_sulfur_dioxide pH residual_sugar sulphates total_sulfur_dioxide volatile_acidity
quality
3 9.925 0.0905 0.035 0.997565 7.50 6.0 3.39 2.1 0.545 15.0 0.845
4 10.000 0.0800 0.090 0.996500 7.50 11.0 3.37 2.1 0.560 26.0 0.670
5 9.700 0.0810 0.230 0.997000 7.80 15.0 3.30 2.2 0.580 47.0 0.580
6 10.500 0.0780 0.260 0.996560 7.90 14.0 3.32 2.2 0.640 35.0 0.490
7 11.500 0.0730 0.400 0.995770 8.80 11.0 3.28 2.3 0.740 27.0 0.370
8 12.150 0.0705 0.420 0.994940 8.25 7.5 3.23 2.1 0.740 21.5 0.370

For each quality of red wines, we can observe the median values of different features.

In [18]:
# Creating pivot table for red wine
columns = list(white.columns).remove('quality')
white.pivot_table(columns, ['quality'], aggfunc=np.median)
Out[18]:
alcohol chlorides citric_acid density fixed_acidity free_sulfur_dioxide pH residual_sugar sulphates total_sulfur_dioxide volatile_acidity
quality
3 10.45 0.041 0.345 0.994425 7.3 33.5 3.215 4.60 0.44 159.5 0.26
4 10.10 0.046 0.290 0.994100 6.9 18.0 3.160 2.50 0.47 117.0 0.32
5 9.50 0.047 0.320 0.995300 6.8 35.0 3.160 7.00 0.47 151.0 0.28
6 10.50 0.043 0.320 0.993660 6.8 34.0 3.180 5.30 0.48 132.0 0.25
7 11.40 0.037 0.310 0.991760 6.7 33.0 3.200 3.65 0.48 122.0 0.25
8 12.00 0.036 0.320 0.991640 6.8 35.0 3.230 4.30 0.46 122.0 0.26
9 12.50 0.031 0.360 0.990300 7.1 28.0 3.280 2.20 0.46 119.0 0.27

We can check how each features are reated with others using corr() function.

The correlation value ranges between -1 to 1.When it is close to 1, it means that there is a strong positive correlation. When the coefficient is close to –1, it means that there is a strong negative correlation. Finally, coefficients close to zero mean that there is no linear correlation. We can observe the detail information using correlation matrix

In [19]:
# red wines
red.corr()
Out[19]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol quality
fixed_acidity 1.000000 -0.256131 0.671703 0.114777 0.093705 -0.153794 -0.113181 0.668047 -0.682978 0.183006 -0.061668 0.124052
volatile_acidity -0.256131 1.000000 -0.552496 0.001918 0.061298 -0.010504 0.076470 0.022026 0.234937 -0.260987 -0.202288 -0.390558
citric_acid 0.671703 -0.552496 1.000000 0.143577 0.203823 -0.060978 0.035533 0.364947 -0.541904 0.312770 0.109903 0.226373
residual_sugar 0.114777 0.001918 0.143577 1.000000 0.055610 0.187049 0.203028 0.355283 -0.085652 0.005527 0.042075 0.013732
chlorides 0.093705 0.061298 0.203823 0.055610 1.000000 0.005562 0.047400 0.200632 -0.265026 0.371260 -0.221141 -0.128907
free_sulfur_dioxide -0.153794 -0.010504 -0.060978 0.187049 0.005562 1.000000 0.667666 -0.021946 0.070377 0.051658 -0.069408 -0.050656
total_sulfur_dioxide -0.113181 0.076470 0.035533 0.203028 0.047400 0.667666 1.000000 0.071269 -0.066495 0.042947 -0.205654 -0.185100
density 0.668047 0.022026 0.364947 0.355283 0.200632 -0.021946 0.071269 1.000000 -0.341699 0.148506 -0.496180 -0.174919
pH -0.682978 0.234937 -0.541904 -0.085652 -0.265026 0.070377 -0.066495 -0.341699 1.000000 -0.196648 0.205633 -0.057731
sulphates 0.183006 -0.260987 0.312770 0.005527 0.371260 0.051658 0.042947 0.148506 -0.196648 1.000000 0.093595 0.251397
alcohol -0.061668 -0.202288 0.109903 0.042075 -0.221141 -0.069408 -0.205654 -0.496180 0.205633 0.093595 1.000000 0.476166
quality 0.124052 -0.390558 0.226373 0.013732 -0.128907 -0.050656 -0.185100 -0.174919 -0.057731 0.251397 0.476166 1.000000

From the above correlation matrix, we can observe that there is a relatively high positive correlation between fixed_acidity and citric_acid, fixed_acidity and density. Similarly we can observe there is a relatively high negative correlation between fixed_acidity and pH. There is relatively high positive correlation between alcohol presence and quality of the wines.

In [20]:
# white wines
white.corr()
Out[20]:
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density pH sulphates alcohol quality
fixed_acidity 1.000000 -0.022697 0.289181 0.089021 0.023086 -0.049396 0.091070 0.265331 -0.425858 -0.017143 -0.120881 -0.113663
volatile_acidity -0.022697 1.000000 -0.149472 0.064286 0.070512 -0.097012 0.089261 0.027114 -0.031915 -0.035728 0.067718 -0.194723
citric_acid 0.289181 -0.149472 1.000000 0.094212 0.114364 0.094077 0.121131 0.149503 -0.163748 0.062331 -0.075729 -0.009209
residual_sugar 0.089021 0.064286 0.094212 1.000000 0.088685 0.299098 0.401439 0.838966 -0.194133 -0.026664 -0.450631 -0.097577
chlorides 0.023086 0.070512 0.114364 0.088685 1.000000 0.101392 0.198910 0.257211 -0.090439 0.016763 -0.360189 -0.209934
free_sulfur_dioxide -0.049396 -0.097012 0.094077 0.299098 0.101392 1.000000 0.615501 0.294210 -0.000618 0.059217 -0.250104 0.008158
total_sulfur_dioxide 0.091070 0.089261 0.121131 0.401439 0.198910 0.615501 1.000000 0.529881 0.002321 0.134562 -0.448892 -0.174737
density 0.265331 0.027114 0.149503 0.838966 0.257211 0.294210 0.529881 1.000000 -0.093591 0.074493 -0.780138 -0.307123
pH -0.425858 -0.031915 -0.163748 -0.194133 -0.090439 -0.000618 0.002321 -0.093591 1.000000 0.155951 0.121432 0.099427
sulphates -0.017143 -0.035728 0.062331 -0.026664 0.016763 0.059217 0.134562 0.074493 0.155951 1.000000 -0.017433 0.053678
alcohol -0.120881 0.067718 -0.075729 -0.450631 -0.360189 -0.250104 -0.448892 -0.780138 0.121432 -0.017433 1.000000 0.435575
quality -0.113663 -0.194723 -0.009209 -0.097577 -0.209934 0.008158 -0.174737 -0.307123 0.099427 0.053678 0.435575 1.000000

We can plot the above correlation matrix using heatmaps too. The visualization using heatmap is a pictorial visualization.

In [21]:
# Custom colour map based on Netflix palette
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ['#221f1f', '#0045a4','#0e4f66'])

def Heatmap(df):  
    corr = df.corr()
    mask = np.zeros_like(corr, dtype= bool)
    mask[np.triu_indices_from(mask)] = True
    fig, ax = plt.subplots(figsize=(10, 7), dpi = 100)
    fig.text(.35,.93,'Correlation Coefficient', fontfamily='serif',fontweight='bold',fontsize=15)
    pl = sns.heatmap(corr, mask = mask, annot = True, fmt = '.2f',cbar = True,
                     cmap=cmap, vmax=.3, vmin=-.3, center=0, square=True, linewidths=2.5)
    plt.tick_params(which=u'both', axis=u'both', length=0)
    plt.tight_layout()
    
    # Show the plot
    plt.show()
In [22]:
# Heat map of red wine
Heatmap(red)
In [23]:
Heatmap(white)
In [38]:
def fig():
    rc = {'figure.figsize':(10,6),
      'figure.dpi': 80,
      'axes.facecolor':'white',
      'axes.grid' : False,
      'font.family':'serif',
      'font.size' : 12}
    plt.rcParams.update(rc)
    
# create a function to label axes and set title
def label(xaxis, yaxis, title):
    """This function set the label axes and title of plots"""
    plt.xlabel(xaxis)
    plt.ylabel(yaxis)
    plt.title(title, fontweight = 'bold', fontsize = 15)
    
# Fuction to remove borders and clean the axes
def remove_borders():
    g = sns.despine(top =True, left=True, right=True, bottom=True)
    plt.tick_params(axis = u'both', which=u'both', length=0)

Discrete Categorical Attributes¶

The attribute quality is categorical in nature and we can visualize this type of attributes using barplot or countplot.

In [40]:
# Countplot for quality of wines present in different category of wines (red and white)
fig()
sns.countplot(x = wine.quality, hue=wine.color, 
              width = 0.6)
# Add title and label axis
label('Quality', 'Count', 'Quality of wine')

# Remove spines
remove_borders()

# Show the plot
plt.show()

From above countplot, we can observe that the average quality of wines are more than good and bad quality of wines in both variants of wines.

We can visualize scatterplot matrix for the better understanding relationship between a pair of variables. It plots every numerical attribute against every other. 'pairplot' of seaborn helps to achieve this

In [43]:
# red wine
sns.pairplot(red)
plt.show()

The correlation between fixed_acidity and citric_acid is 0.67 (you could find this value in the correlation matrix of red wines). Looking for scatterplot for this pair of variables, we can see the positive linear correlation between these two variables. We can observe the upward trend, and also the points are not too dispersed.

Similarly we can plot scatter plot between variables for white wines.

In [44]:
# Pairplot
sns.pairplot(white)

plt.show()